Sixt Data Science Lab - Test Task for Data Scientist Job Candidates¶

Introduction¶

In this test task you will have an opportunity to demonstrate your skills of a Data Scientist from various angles - processing data, analyzing and vizalizing it, finding insights, applying predictive techniques and explaining your reasoning about it.

The task is based around a bike sharing dataset openly available at UCI Machine Learning Repository [1].

Please go through the steps below, build up the necessary code and comment on your choices.

Part 1 - Data Loading and Environment Preparation¶

Tasks:

  1. Prepare a Python 3 virtual environment (with virtualenv command). requirements.txt output of pip freeze command should be included as part of your submission.
  2. Load the data from UCI Repository and put it into the same folder with the notebook. The link to it is https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset . Here is an available mirror in case the above website is down: https://data.world/uci/bike-sharing-dataset
  3. We split the data into two parts. One dataset containing the last 30 days and one dataset with the rest.

Setup Environment

Add utils¶

In [55]:
import os, sys
os.getcwd()

# subfolders
print(os.listdir("data"))
print(os.listdir("output"))
Out[55]:
'/mnt/N0326018/project'
['day.csv', '.ipynb_checkpoints', 'Readme-Data.txt', 'hour.csv']
['analyze_dataset.html', '.ipynb_checkpoints', 'analyze_dataset_comparison.html', 'sample_submission.csv']

Confirm virtual environment¶

In [56]:
print(sys.prefix)
print(sys.executable)
/mnt/N0326018/project/.venv
/mnt/N0326018/project/.venv/bin/python

Libraries¶

In [57]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from scipy import stats
import seaborn as sns
import sweetviz as sv
from scipy.stats import scoreatpercentile
from statsmodels.graphics.gofplots import qqplot
import time
import math

from sklearn import preprocessing, metrics, linear_model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split, StratifiedKFold,  cross_val_score, GridSearchCV, KFold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
import pickle

import warnings
warnings.filterwarnings('ignore')

Setup info¶

In [58]:
# Config display options
pd.options.display.max_colwidth = 10000
pd.options.display.float_format = '{:.2f}'.format

# Display all outputs in Jupyter Notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# I want pandas to show all columns and up to * rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 1000)

# Environment for images
# This sets reasonable defaults for font size for
# a figure that will go in a notebook
sns.set_context("notebook")

# Set the font to be serif, rather than sans
sns.set(font='serif')

# Make the background white, and specify the
# specific font family
sns.set_style("whitegrid")

Load Data

In [59]:
# read raw training data
df_all = pd.read_csv('data/day.csv')
df_hour_all = pd.read_csv('data/hour.csv')

# split dataset
df_last30 = df_all.tail(30) # use to test data in unseen data
df = df_all.iloc[:-30, :] # use to train data

df.head()
df.shape
Out[59]:
instant dteday season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 6 0 2 0.34 0.36 0.81 0.16 331 654 985
1 2 2011-01-02 1 0 1 0 0 0 2 0.36 0.35 0.70 0.25 131 670 801
2 3 2011-01-03 1 0 1 0 1 1 1 0.20 0.19 0.44 0.25 120 1229 1349
3 4 2011-01-04 1 0 1 0 2 1 1 0.20 0.21 0.59 0.16 108 1454 1562
4 5 2011-01-05 1 0 1 0 3 1 1 0.23 0.23 0.44 0.19 82 1518 1600
Out[59]:
(701, 16)

Part 2 - Data Processing and Analysis¶

Tasks:

  1. Perform all needed steps to load and clean the data. Please comment the major steps of your code.
  2. Visualise rentals of bikes per day.
  3. Assume that each bike has exactly maximum 12 rentals per day.
    • Find the maximum number of bicycles nmax that was needed in any one day. answer here
    • Find the 95%-percentile of bicycles n95 that was needed in any one day. answer here
  4. Visualize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)

Dataset characteristics

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit : 
	- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
	- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
	- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
	- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

Storytelling

EDA - Understanding Data

In [60]:
df_all.info()
df_all.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB
Out[60]:
instant season yr mnth holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
count 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00 731.00
mean 366.00 2.50 0.50 6.52 0.03 3.00 0.68 1.40 0.50 0.47 0.63 0.19 848.18 3656.17 4504.35
std 211.17 1.11 0.50 3.45 0.17 2.00 0.47 0.54 0.18 0.16 0.14 0.08 686.62 1560.26 1937.21
min 1.00 1.00 0.00 1.00 0.00 0.00 0.00 1.00 0.06 0.08 0.00 0.02 2.00 20.00 22.00
25% 183.50 2.00 0.00 4.00 0.00 1.00 0.00 1.00 0.34 0.34 0.52 0.13 315.50 2497.00 3152.00
50% 366.00 3.00 1.00 7.00 0.00 3.00 1.00 1.00 0.50 0.49 0.63 0.18 713.00 3662.00 4548.00
75% 548.50 3.00 1.00 10.00 0.00 5.00 1.00 2.00 0.66 0.61 0.73 0.23 1096.00 4776.50 5956.00
max 731.00 4.00 1.00 12.00 1.00 6.00 1.00 3.00 0.86 0.84 0.97 0.51 3410.00 6946.00 8714.00

Report Overview¶

In [61]:
# Report about total dataset, with target feature
eda_report = sv.analyze([df_all,'Bike Rentals'], 'cnt')
eda_report.show_html('output/analyze_dataset.html')
eda_report.show_notebook(layout='widescreen', w=1500, h=700, scale=0.7)
                                             |          | [  0%]   00:00 -> (? left)
Report output/analyze_dataset.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Comments / reasoning:

  • I observe a strong positive correlation between registered and total rents
  • Casual users have a moderate correlation with total rents
  • We have less total rents during Spring months (more rentals during Summer/Autumn) and we observe much more rentals during holidays, since it is really conducive to ride bike in that season and moment of the week. Therefore June, July, August and September has relatively higher demand for bicycle.
  • Users prefer to rent bikes when windspeed is < 0.2 (low windspeed), however there's an interesting pike when windspeed is around 0.45
  • Humidity only has a negotive influence for bike rentals when is < 0.8 (people not comfortable with sweating too much...)
  • When the temperature is higher we observe higher rentals
  • There's a great increase in bike rentals from 2011 to 2012
In [62]:
# Report comparison about split datasets, with target feature
eda_report_comparison = sv.compare([df_all, 'training data'], [df_last30, 'last 30 days'], 'cnt')
eda_report_comparison.show_html('output/analyze_dataset_comparison.html')
eda_report_comparison.show_notebook(layout='widescreen', w=1500, h=700, scale=0.7)
                                             |          | [  0%]   00:00 -> (? left)
Report output/analyze_dataset_comparison.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

Comments / reasoning:

  • Different behaviour bike rentals during weekdays, last 30 days dataset has more rentals on Wednesdays and very low on Mondays (if 0 = Monday, 1 = Tuesday, 2 = Wednesday, 3 = Thursday, 4 = Friday, 5 = Saturday, 6 = Sunday)
  • From last 30 days dataset I observe that windspeed impacted more rentals when between 0.2 and 0.3.

EDA - Cleaning Data

  • Remove one of the temperature variables, because they are highly correlated
In [63]:
# rename columns
df_all.rename(columns={'instant':'id','dteday':'datetime','yr':'year','mnth':'month','weathersit':'weather_condition',
                       'temp':'temperature', 'atemp':'feel_temperature', 'hum':'humidity','cnt':'total_count'},inplace=True)
df_all.head()
df_all.dtypes
Out[63]:
id datetime season year month holiday weekday workingday weather_condition temperature feel_temperature humidity windspeed casual registered total_count
0 1 2011-01-01 1 0 1 0 6 0 2 0.34 0.36 0.81 0.16 331 654 985
1 2 2011-01-02 1 0 1 0 0 0 2 0.36 0.35 0.70 0.25 131 670 801
2 3 2011-01-03 1 0 1 0 1 1 1 0.20 0.19 0.44 0.25 120 1229 1349
3 4 2011-01-04 1 0 1 0 2 1 1 0.20 0.21 0.59 0.16 108 1454 1562
4 5 2011-01-05 1 0 1 0 3 1 1 0.23 0.23 0.44 0.19 82 1518 1600
Out[63]:
id                     int64
datetime              object
season                 int64
year                   int64
month                  int64
holiday                int64
weekday                int64
workingday             int64
weather_condition      int64
temperature          float64
feel_temperature     float64
humidity             float64
windspeed            float64
casual                 int64
registered             int64
total_count            int64
dtype: object
In [64]:
df_all['datetime']=pd.to_datetime(df_all.datetime)
df_all['season']=df_all.season.astype('category')
df_all['year']=df_all.year.astype('category')
df_all['month']=df_all.month.astype('category')
df_all['holiday']=df_all.holiday.astype('category')
df_all['weekday']=df_all.weekday.astype('category')
df_all['workingday']=df_all.workingday.astype('category')
df_all['weather_condition']=df_all.weather_condition.astype('category')

df_all.dtypes
Out[64]:
id                            int64
datetime             datetime64[ns]
season                     category
year                       category
month                      category
holiday                    category
weekday                    category
workingday                 category
weather_condition          category
temperature                 float64
feel_temperature            float64
humidity                    float64
windspeed                   float64
casual                        int64
registered                    int64
total_count                   int64
dtype: object
  • Missings
In [65]:
print('df_all shape : ' + str(df_all.shape))

# Split dataframe by numerical and categorical columns
num_df = df_all.select_dtypes(include = ['int64', 'float64'])
cat_df = df_all.select_dtypes(include = ['object', 'bool'])

# Get list of columns with missing values
missing_num = num_df.isnull().sum()
columns_with_missing_num = missing_num[missing_num > 0]
print("**These are the NUMERIC columns with missing values:**\n{} \n"\
      .format(columns_with_missing_num))

# Get list of columns with missing values
missing_cat = cat_df.isnull().sum()
columns_with_missing_cat = (missing_cat[(missing_cat > 0) & (missing_cat < len(df_all))])
print("**These are the CATEGORICAL columns with missing values:**\n{} \n"\
      .format(columns_with_missing_cat))

columns_with_all_missing_num = missing_num[missing_num == len(df_all)]
columns_with_all_missing_num = list(columns_with_all_missing_num.index)
print("**These are the NUMERICAL columns with ALL missing values:**\n{} \n"\
      .format(columns_with_all_missing_num))

columns_with_all_missing_cat = missing_cat[missing_cat == len(df_all)]
columns_with_all_missing_cat = list(columns_with_all_missing_cat.index)
print("**These are the CATEGORICAL columns with ALL missing values:**\n{}"\
      .format(columns_with_all_missing_cat))

df_all.drop(columns_with_all_missing_num, axis = 1, inplace = True)
df_all.drop(columns_with_all_missing_cat, axis = 1, inplace = True)
df_all shape : (731, 16)
**These are the NUMERIC columns with missing values:**
Series([], dtype: int64) 

**These are the CATEGORICAL columns with missing values:**
Series([], dtype: float64) 

**These are the NUMERICAL columns with ALL missing values:**
[] 

**These are the CATEGORICAL columns with ALL missing values:**
[]
  • Handling High Cardinality in Categorical columns
In [66]:
# Considering cardinality_threshold
cardinality_threshold = 10

# Get list of columns with their cardinality - don't want to consider numeric columns
categorical_columns = list(df_all.select_dtypes(exclude=[np.number]).columns)
cardinality = df_all[categorical_columns].apply(pd.Series.nunique)
columns_too_high_cardinality = list(cardinality[cardinality > cardinality_threshold].index)
print("There are {} columns with high cardinality. Threshold: {} categories."\
      .format(len(columns_too_high_cardinality), cardinality_threshold))
columns_too_high_cardinality
There are 2 columns with high cardinality. Threshold: 10 categories.
Out[66]:
['datetime', 'month']
  • Remove ID & unnecessary columns
In [67]:
ID_variables = ['id']

df_all.drop(ID_variables, axis = 1, inplace = True)
In [68]:
# casual & registered – These variables cannot be predicted
no_value_variables = ['casual', 'registered']

df_all.drop(no_value_variables, axis = 1, inplace = True)
  • Remove constant columns
In [69]:
# get list of columns with constant value
columns_constant = list(df_all.columns[df_all.nunique() <= 1])
print("There are {} columns with constant values".format(len(columns_constant)))
columns_constant

df_all.drop(columns_constant, axis = 1, inplace = True)
There are 0 columns with constant values
Out[69]:
[]
  • Remove perfect correlated
In [70]:
corr_matrix = df_all.select_dtypes(exclude=[np.object]).corr().abs()

# select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# find features with correlation > 0.95
columns_perfect_correlation = [column for column in upper.columns if any(upper[column] > 0.95)]
print("There are {} columns that are perfectly correlated with other columns: {} "\
          .format(len(columns_perfect_correlation), columns_perfect_correlation))

columns_perfect_correlation
There are 1 columns that are perfectly correlated with other columns: ['feel_temperature'] 
Out[70]:
['feel_temperature']
In [71]:
# I prefer to drop 'temp' the column atemp is more appropriate for modelling purposes, from human perspective
df_all.drop('temperature', axis=1, inplace=True)
  • Outliers
In [72]:
fig,ax = plt.subplots(figsize = (5, 3) )

# boxplot for total_count outliers
sns.boxplot(data = df_all[['total_count']])
ax.set_title('total_count outliers')
plt.show()
Out[72]:
<AxesSubplot:>
Out[72]:
Text(0.5, 1.0, 'total_count outliers')
In [73]:
# plot box plot of categorical variables

plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'total_count', data = df_all)
plt.subplot(3,3,2)
sns.boxplot(x = 'year', y = 'total_count', data = df_all)
plt.subplot(3,3,3)
sns.boxplot(x = 'month', y = 'total_count', data = df_all)
plt.subplot(3,3,4)
sns.boxplot(x = 'holiday', y = 'total_count', data = df_all)
plt.subplot(3,3,5)
sns.boxplot(x = 'weekday', y = 'total_count', data = df_all)
plt.subplot(3,3,6)
sns.boxplot(x = 'workingday', y = 'total_count', data = df_all)
plt.subplot(3,3,7)
sns.boxplot(x = 'weather_condition', y = 'total_count', data = df_all)
plt.show()
Out[73]:
<Figure size 2000x1200 with 0 Axes>
Out[73]:
<AxesSubplot:>
Out[73]:
<AxesSubplot:xlabel='season', ylabel='total_count'>
Out[73]:
<AxesSubplot:>
Out[73]:
<AxesSubplot:xlabel='year', ylabel='total_count'>
Out[73]:
<AxesSubplot:>
Out[73]:
<AxesSubplot:xlabel='month', ylabel='total_count'>
Out[73]:
<AxesSubplot:>
Out[73]:
<AxesSubplot:xlabel='holiday', ylabel='total_count'>
Out[73]:
<AxesSubplot:>
Out[73]:
<AxesSubplot:xlabel='weekday', ylabel='total_count'>
Out[73]:
<AxesSubplot:>
Out[73]:
<AxesSubplot:xlabel='workingday', ylabel='total_count'>
Out[73]:
<AxesSubplot:>
Out[73]:
<AxesSubplot:xlabel='weather_condition', ylabel='total_count'>
In [74]:
# plot box plot of continuous variables

plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
plt.boxplot(df_all["feel_temperature"])
plt.subplot(2,3,3)
plt.boxplot(df_all["humidity"])
plt.subplot(2,3,4)
plt.boxplot(df_all["windspeed"])
plt.show()
Out[74]:
<Figure size 2000x1200 with 0 Axes>
Out[74]:
<AxesSubplot:>
Out[74]:
{'whiskers': [<matplotlib.lines.Line2D at 0x7f61db51ecd0>,
  <matplotlib.lines.Line2D at 0x7f61db51efa0>],
 'caps': [<matplotlib.lines.Line2D at 0x7f61db2132b0>,
  <matplotlib.lines.Line2D at 0x7f61db213580>],
 'boxes': [<matplotlib.lines.Line2D at 0x7f61db51ea00>],
 'medians': [<matplotlib.lines.Line2D at 0x7f61db213850>],
 'fliers': [<matplotlib.lines.Line2D at 0x7f61db213b20>],
 'means': []}
Out[74]:
<AxesSubplot:>
Out[74]:
{'whiskers': [<matplotlib.lines.Line2D at 0x7f61dae14c40>,
  <matplotlib.lines.Line2D at 0x7f61dae14f10>],
 'caps': [<matplotlib.lines.Line2D at 0x7f61dae07220>,
  <matplotlib.lines.Line2D at 0x7f61dae074f0>],
 'boxes': [<matplotlib.lines.Line2D at 0x7f61dae14970>],
 'medians': [<matplotlib.lines.Line2D at 0x7f61dae077c0>],
 'fliers': [<matplotlib.lines.Line2D at 0x7f61dae07a90>],
 'means': []}
Out[74]:
<AxesSubplot:>
Out[74]:
{'whiskers': [<matplotlib.lines.Line2D at 0x7f61e0d42ac0>,
  <matplotlib.lines.Line2D at 0x7f61e0d42d90>],
 'caps': [<matplotlib.lines.Line2D at 0x7f617c1a60a0>,
  <matplotlib.lines.Line2D at 0x7f617c1a6370>],
 'boxes': [<matplotlib.lines.Line2D at 0x7f61e0d427f0>],
 'medians': [<matplotlib.lines.Line2D at 0x7f617c1a6640>],
 'fliers': [<matplotlib.lines.Line2D at 0x7f617c1a6910>],
 'means': []}
In [75]:
fig,ax = plt.subplots(figsize = (7, 5))

# zoom in for outliers regarding windspeed & humidity features
sns.boxplot(data = df_all[['windspeed','humidity']])
ax.set_title('Windspeed_Humidity outliers')
plt.show()
Out[75]:
<AxesSubplot:>
Out[75]:
Text(0.5, 1.0, 'Windspeed_Humidity outliers')
  • Replace and impute the outliers
In [76]:
from fancyimpute import KNN

# create dataframe for outliers
outliers = pd.DataFrame(df_all, columns=['windspeed','humidity'])

# replace outliers by n/a
columns = ['windspeed','humidity']
for i in columns:
    q75, q25 = np.percentile(outliers.loc[:,i],[75,25]) # Split data in 2 diff quantiles
    iqr = q75 - q25 # inter quantile range
    min = q25 - (iqr*1.5)
    max = q75 + (iqr*1.5) 
    outliers.loc[outliers.loc[:,i] < min, :i] = np.nan
    outliers.loc[outliers.loc[:,i] > max, :i] = np.nan

# impute outliers by using the average
outliers['windspeed'] = outliers['windspeed'].fillna(outliers['windspeed'].mean())
outliers['humidity'] = outliers['humidity'].fillna(outliers['humidity'].mean())
  • Replace the original dataset by imputated dataset
In [77]:
#Replacing the imputated windspeed
df_all['windspeed'] = df_all['windspeed'].replace(outliers['windspeed'])

#Replacing the imputated humidity
df_all['humidity'] = df_all['humidity'].replace(outliers['humidity'])
df_all.head(5)
Out[77]:
datetime season year month holiday weekday workingday weather_condition feel_temperature humidity windspeed total_count
0 2011-01-01 1 0 1 0 6 0 2 0.36 0.81 0.16 985
1 2011-01-02 1 0 1 0 0 0 2 0.35 0.70 0.25 801
2 2011-01-03 1 0 1 0 1 1 1 0.19 0.44 0.25 1349
3 2011-01-04 1 0 1 0 2 1 1 0.21 0.59 0.16 1562
4 2011-01-05 1 0 1 0 3 1 1 0.23 0.44 0.19 1600
  • Normal Probability Plot
In [78]:
fig = plt.figure(figsize=(15,8))
stats.probplot(df_all.total_count.tolist(), dist='norm',plot=plt)
plt.show()
Out[78]:
((array([-3.10612952, -2.83371839, -2.68121219, -2.57340905, -2.48915191,
         -2.41955673, -2.36001798, -2.30782877, -2.26125818, -2.21912992,
         -2.18060696, -2.14507173, -2.11205508, -2.08119197, -2.05219258,
         -2.0248228 , -1.99889075, -1.97423711, -1.95072808, -1.92825019,
         -1.90670633, -1.88601273, -1.8660966 , -1.84689427, -1.82834975,
         -1.81041348, -1.79304141, -1.77619419, -1.75983653, -1.74393663,
         -1.72846577, -1.71339788, -1.69870925, -1.68437825, -1.67038506,
         -1.6567115 , -1.64334086, -1.63025771, -1.6174478 , -1.60489794,
         -1.59259587, -1.58053022, -1.56869036, -1.55706641, -1.54564912,
         -1.53442983, -1.52340042, -1.51255328, -1.50188124, -1.49137757,
         -1.4810359 , -1.47085025, -1.46081495, -1.45092464, -1.44117426,
         -1.431559  , -1.4220743 , -1.41271583, -1.40347947, -1.39436132,
         -1.38535765, -1.3764649 , -1.36767969, -1.35899879, -1.35041911,
         -1.3419377 , -1.33355173, -1.32525852, -1.31705546, -1.30894008,
         -1.30091001, -1.29296295, -1.28509673, -1.27730922, -1.26959842,
         -1.26196238, -1.25439922, -1.24690714, -1.2394844 , -1.23212934,
         -1.22484033, -1.21761582, -1.21045431, -1.20335435, -1.19631454,
         -1.18933352, -1.18240998, -1.17554267, -1.16873035, -1.16197185,
         -1.155266  , -1.14861171, -1.14200789, -1.13545351, -1.12894754,
         -1.12248901, -1.11607697, -1.10971049, -1.10338867, -1.09711064,
         -1.09087556, -1.08468261, -1.07853098, -1.07241989, -1.06634859,
         -1.06031635, -1.05432244, -1.04836618, -1.04244688, -1.03656388,
         -1.03071654, -1.02490423, -1.01912634, -1.01338227, -1.00767144,
         -1.0019933 , -0.99634727, -0.99073283, -0.98514945, -0.9795966 ,
         -0.97407381, -0.96858056, -0.96311639, -0.95768082, -0.9522734 ,
         -0.94689368, -0.94154123, -0.93621562, -0.93091643, -0.92564325,
         -0.92039569, -0.91517335, -0.90997585, -0.90480282, -0.89965388,
         -0.89452869, -0.88942689, -0.88434814, -0.87929209, -0.87425842,
         -0.86924681, -0.86425694, -0.85928849, -0.85434116, -0.84941466,
         -0.84450869, -0.83962296, -0.83475719, -0.8299111 , -0.82508442,
         -0.8202769 , -0.81548825, -0.81071823, -0.80596659, -0.80123308,
         -0.79651745, -0.79181947, -0.78713889, -0.7824755 , -0.77782907,
         -0.77319937, -0.76858618, -0.76398929, -0.75940848, -0.75484356,
         -0.75029432, -0.74576055, -0.74124205, -0.73673864, -0.73225012,
         -0.72777631, -0.72331702, -0.71887206, -0.71444126, -0.71002444,
         -0.70562143, -0.70123206, -0.69685616, -0.69249356, -0.6881441 ,
         -0.68380762, -0.67948396, -0.67517297, -0.67087448, -0.66658836,
         -0.66231445, -0.6580526 , -0.65380267, -0.64956452, -0.64533801,
         -0.64112299, -0.63691932, -0.63272689, -0.62854555, -0.62437516,
         -0.62021561, -0.61606676, -0.61192849, -0.60780067, -0.60368318,
         -0.59957591, -0.59547872, -0.5913915 , -0.58731414, -0.58324652,
         -0.57918853, -0.57514005, -0.57110098, -0.5670712 , -0.56305061,
         -0.5590391 , -0.55503657, -0.55104291, -0.54705802, -0.5430818 ,
         -0.53911414, -0.53515496, -0.53120414, -0.5272616 , -0.52332724,
         -0.51940096, -0.51548267, -0.51157228, -0.5076697 , -0.50377484,
         -0.4998876 , -0.4960079 , -0.49213565, -0.48827077, -0.48441317,
         -0.48056276, -0.47671946, -0.4728832 , -0.46905388, -0.46523142,
         -0.46141575, -0.45760679, -0.45380445, -0.45000867, -0.44621936,
         -0.44243644, -0.43865984, -0.43488949, -0.43112532, -0.42736724,
         -0.42361519, -0.41986909, -0.41612887, -0.41239447, -0.40866581,
         -0.40494282, -0.40122544, -0.39751359, -0.39380721, -0.39010623,
         -0.38641059, -0.38272022, -0.37903506, -0.37535503, -0.37168008,
         -0.36801015, -0.36434516, -0.36068506, -0.35702979, -0.35337928,
         -0.34973347, -0.34609231, -0.34245573, -0.33882367, -0.33519607,
         -0.33157289, -0.32795405, -0.3243395 , -0.32072918, -0.31712304,
         -0.31352101, -0.30992305, -0.3063291 , -0.3027391 , -0.299153  ,
         -0.29557074, -0.29199227, -0.28841753, -0.28484648, -0.28127906,
         -0.27771521, -0.27415488, -0.27059803, -0.2670446 , -0.26349453,
         -0.25994779, -0.25640431, -0.25286405, -0.24932695, -0.24579297,
         -0.24226206, -0.23873416, -0.23520924, -0.23168723, -0.2281681 ,
         -0.22465178, -0.22113825, -0.21762744, -0.21411931, -0.21061382,
         -0.20711091, -0.20361054, -0.20011267, -0.19661724, -0.19312421,
         -0.18963354, -0.18614518, -0.18265908, -0.17917519, -0.17569349,
         -0.17221391, -0.16873641, -0.16526095, -0.16178749, -0.15831598,
         -0.15484638, -0.15137863, -0.14791271, -0.14444857, -0.14098615,
         -0.13752543, -0.13406635, -0.13060888, -0.12715297, -0.12369857,
         -0.12024565, -0.11679416, -0.11334407, -0.10989532, -0.10644788,
         -0.1030017 , -0.09955675, -0.09611298, -0.09267034, -0.08922881,
         -0.08578833, -0.08234887, -0.07891038, -0.07547282, -0.07203616,
         -0.06860034, -0.06516534, -0.0617311 , -0.05829759, -0.05486477,
         -0.0514326 , -0.04800103, -0.04457003, -0.04113955, -0.03770955,
         -0.03428   , -0.03085085, -0.02742207, -0.0239936 , -0.02056542,
         -0.01713748, -0.01370974, -0.01028217, -0.00685471, -0.00342734,
          0.        ,  0.00342734,  0.00685471,  0.01028217,  0.01370974,
          0.01713748,  0.02056542,  0.0239936 ,  0.02742207,  0.03085085,
          0.03428   ,  0.03770955,  0.04113955,  0.04457003,  0.04800103,
          0.0514326 ,  0.05486477,  0.05829759,  0.0617311 ,  0.06516534,
          0.06860034,  0.07203616,  0.07547282,  0.07891038,  0.08234887,
          0.08578833,  0.08922881,  0.09267034,  0.09611298,  0.09955675,
          0.1030017 ,  0.10644788,  0.10989532,  0.11334407,  0.11679416,
          0.12024565,  0.12369857,  0.12715297,  0.13060888,  0.13406635,
          0.13752543,  0.14098615,  0.14444857,  0.14791271,  0.15137863,
          0.15484638,  0.15831598,  0.16178749,  0.16526095,  0.16873641,
          0.17221391,  0.17569349,  0.17917519,  0.18265908,  0.18614518,
          0.18963354,  0.19312421,  0.19661724,  0.20011267,  0.20361054,
          0.20711091,  0.21061382,  0.21411931,  0.21762744,  0.22113825,
          0.22465178,  0.2281681 ,  0.23168723,  0.23520924,  0.23873416,
          0.24226206,  0.24579297,  0.24932695,  0.25286405,  0.25640431,
          0.25994779,  0.26349453,  0.2670446 ,  0.27059803,  0.27415488,
          0.27771521,  0.28127906,  0.28484648,  0.28841753,  0.29199227,
          0.29557074,  0.299153  ,  0.3027391 ,  0.3063291 ,  0.30992305,
          0.31352101,  0.31712304,  0.32072918,  0.3243395 ,  0.32795405,
          0.33157289,  0.33519607,  0.33882367,  0.34245573,  0.34609231,
          0.34973347,  0.35337928,  0.35702979,  0.36068506,  0.36434516,
          0.36801015,  0.37168008,  0.37535503,  0.37903506,  0.38272022,
          0.38641059,  0.39010623,  0.39380721,  0.39751359,  0.40122544,
          0.40494282,  0.40866581,  0.41239447,  0.41612887,  0.41986909,
          0.42361519,  0.42736724,  0.43112532,  0.43488949,  0.43865984,
          0.44243644,  0.44621936,  0.45000867,  0.45380445,  0.45760679,
          0.46141575,  0.46523142,  0.46905388,  0.4728832 ,  0.47671946,
          0.48056276,  0.48441317,  0.48827077,  0.49213565,  0.4960079 ,
          0.4998876 ,  0.50377484,  0.5076697 ,  0.51157228,  0.51548267,
          0.51940096,  0.52332724,  0.5272616 ,  0.53120414,  0.53515496,
          0.53911414,  0.5430818 ,  0.54705802,  0.55104291,  0.55503657,
          0.5590391 ,  0.56305061,  0.5670712 ,  0.57110098,  0.57514005,
          0.57918853,  0.58324652,  0.58731414,  0.5913915 ,  0.59547872,
          0.59957591,  0.60368318,  0.60780067,  0.61192849,  0.61606676,
          0.62021561,  0.62437516,  0.62854555,  0.63272689,  0.63691932,
          0.64112299,  0.64533801,  0.64956452,  0.65380267,  0.6580526 ,
          0.66231445,  0.66658836,  0.67087448,  0.67517297,  0.67948396,
          0.68380762,  0.6881441 ,  0.69249356,  0.69685616,  0.70123206,
          0.70562143,  0.71002444,  0.71444126,  0.71887206,  0.72331702,
          0.72777631,  0.73225012,  0.73673864,  0.74124205,  0.74576055,
          0.75029432,  0.75484356,  0.75940848,  0.76398929,  0.76858618,
          0.77319937,  0.77782907,  0.7824755 ,  0.78713889,  0.79181947,
          0.79651745,  0.80123308,  0.80596659,  0.81071823,  0.81548825,
          0.8202769 ,  0.82508442,  0.8299111 ,  0.83475719,  0.83962296,
          0.84450869,  0.84941466,  0.85434116,  0.85928849,  0.86425694,
          0.86924681,  0.87425842,  0.87929209,  0.88434814,  0.88942689,
          0.89452869,  0.89965388,  0.90480282,  0.90997585,  0.91517335,
          0.92039569,  0.92564325,  0.93091643,  0.93621562,  0.94154123,
          0.94689368,  0.9522734 ,  0.95768082,  0.96311639,  0.96858056,
          0.97407381,  0.9795966 ,  0.98514945,  0.99073283,  0.99634727,
          1.0019933 ,  1.00767144,  1.01338227,  1.01912634,  1.02490423,
          1.03071654,  1.03656388,  1.04244688,  1.04836618,  1.05432244,
          1.06031635,  1.06634859,  1.07241989,  1.07853098,  1.08468261,
          1.09087556,  1.09711064,  1.10338867,  1.10971049,  1.11607697,
          1.12248901,  1.12894754,  1.13545351,  1.14200789,  1.14861171,
          1.155266  ,  1.16197185,  1.16873035,  1.17554267,  1.18240998,
          1.18933352,  1.19631454,  1.20335435,  1.21045431,  1.21761582,
          1.22484033,  1.23212934,  1.2394844 ,  1.24690714,  1.25439922,
          1.26196238,  1.26959842,  1.27730922,  1.28509673,  1.29296295,
          1.30091001,  1.30894008,  1.31705546,  1.32525852,  1.33355173,
          1.3419377 ,  1.35041911,  1.35899879,  1.36767969,  1.3764649 ,
          1.38535765,  1.39436132,  1.40347947,  1.41271583,  1.4220743 ,
          1.431559  ,  1.44117426,  1.45092464,  1.46081495,  1.47085025,
          1.4810359 ,  1.49137757,  1.50188124,  1.51255328,  1.52340042,
          1.53442983,  1.54564912,  1.55706641,  1.56869036,  1.58053022,
          1.59259587,  1.60489794,  1.6174478 ,  1.63025771,  1.64334086,
          1.6567115 ,  1.67038506,  1.68437825,  1.69870925,  1.71339788,
          1.72846577,  1.74393663,  1.75983653,  1.77619419,  1.79304141,
          1.81041348,  1.82834975,  1.84689427,  1.8660966 ,  1.88601273,
          1.90670633,  1.92825019,  1.95072808,  1.97423711,  1.99889075,
          2.0248228 ,  2.05219258,  2.08119197,  2.11205508,  2.14507173,
          2.18060696,  2.21912992,  2.26125818,  2.30782877,  2.36001798,
          2.41955673,  2.48915191,  2.57340905,  2.68121219,  2.83371839,
          3.10612952]),
  array([  22,  431,  441,  506,  605,  623,  627,  683,  705,  754,  795,
          801,  822,  920,  959,  981,  985,  986, 1000, 1005, 1011, 1013,
         1027, 1096, 1096, 1098, 1107, 1115, 1162, 1162, 1167, 1204, 1248,
         1263, 1301, 1317, 1321, 1341, 1349, 1360, 1406, 1416, 1421, 1446,
         1450, 1461, 1471, 1472, 1495, 1501, 1510, 1526, 1529, 1530, 1536,
         1538, 1543, 1550, 1562, 1589, 1600, 1605, 1606, 1607, 1623, 1635,
         1650, 1683, 1685, 1685, 1693, 1708, 1712, 1746, 1749, 1787, 1795,
         1796, 1807, 1812, 1815, 1817, 1834, 1842, 1851, 1865, 1872, 1891,
         1913, 1917, 1927, 1944, 1951, 1969, 1977, 1977, 1985, 1996, 2028,
         2034, 2046, 2056, 2077, 2077, 2114, 2115, 2121, 2132, 2133, 2134,
         2162, 2169, 2177, 2192, 2209, 2210, 2227, 2236, 2252, 2277, 2294,
         2298, 2302, 2311, 2368, 2376, 2395, 2402, 2416, 2417, 2423, 2424,
         2424, 2425, 2425, 2429, 2431, 2432, 2455, 2471, 2475, 2485, 2493,
         2496, 2566, 2594, 2633, 2659, 2660, 2689, 2703, 2710, 2729, 2732,
         2739, 2743, 2744, 2765, 2792, 2802, 2808, 2832, 2843, 2895, 2913,
         2914, 2918, 2927, 2933, 2935, 2947, 2999, 3005, 3053, 3068, 3068,
         3071, 3095, 3115, 3117, 3126, 3129, 3141, 3163, 3190, 3194, 3204,
         3214, 3214, 3228, 3239, 3243, 3249, 3267, 3272, 3285, 3292, 3310,
         3322, 3331, 3333, 3348, 3351, 3351, 3368, 3372, 3376, 3387, 3389,
         3392, 3403, 3409, 3422, 3423, 3425, 3429, 3456, 3485, 3487, 3510,
         3520, 3523, 3542, 3544, 3570, 3574, 3577, 3598, 3606, 3613, 3614,
         3620, 3623, 3624, 3641, 3644, 3649, 3659, 3663, 3669, 3709, 3717,
         3727, 3740, 3744, 3747, 3750, 3761, 3767, 3777, 3784, 3784, 3785,
         3786, 3805, 3811, 3820, 3830, 3831, 3840, 3846, 3855, 3867, 3872,
         3873, 3894, 3907, 3910, 3915, 3922, 3926, 3940, 3944, 3956, 3958,
         3959, 3974, 3974, 3982, 4010, 4023, 4035, 4036, 4040, 4046, 4058,
         4066, 4067, 4068, 4073, 4073, 4075, 4086, 4094, 4097, 4098, 4098,
         4105, 4109, 4118, 4120, 4123, 4127, 4128, 4150, 4151, 4153, 4154,
         4169, 4182, 4186, 4187, 4189, 4191, 4195, 4195, 4205, 4220, 4258,
         4266, 4270, 4274, 4274, 4294, 4302, 4304, 4308, 4318, 4322, 4326,
         4332, 4333, 4334, 4338, 4339, 4342, 4352, 4359, 4362, 4363, 4367,
         4375, 4378, 4381, 4390, 4400, 4401, 4401, 4433, 4451, 4456, 4458,
         4459, 4459, 4460, 4475, 4484, 4486, 4492, 4507, 4509, 4511, 4521,
         4539, 4541, 4548, 4549, 4553, 4563, 4569, 4570, 4575, 4576, 4579,
         4585, 4586, 4590, 4592, 4595, 4602, 4608, 4629, 4630, 4634, 4639,
         4648, 4649, 4649, 4656, 4660, 4661, 4665, 4669, 4672, 4677, 4679,
         4687, 4694, 4708, 4713, 4714, 4717, 4725, 4727, 4744, 4748, 4758,
         4758, 4760, 4763, 4765, 4773, 4780, 4785, 4788, 4790, 4792, 4795,
         4803, 4826, 4833, 4835, 4839, 4840, 4844, 4845, 4862, 4864, 4866,
         4881, 4891, 4905, 4906, 4911, 4916, 4917, 4940, 4966, 4968, 4972,
         4978, 4985, 4990, 4991, 4996, 5008, 5010, 5020, 5026, 5035, 5041,
         5046, 5047, 5058, 5062, 5084, 5087, 5099, 5102, 5107, 5115, 5115,
         5117, 5119, 5119, 5130, 5138, 5146, 5169, 5170, 5180, 5191, 5191,
         5202, 5202, 5204, 5217, 5225, 5255, 5259, 5260, 5260, 5267, 5298,
         5302, 5305, 5312, 5312, 5315, 5319, 5323, 5336, 5342, 5345, 5362,
         5375, 5382, 5409, 5409, 5423, 5424, 5445, 5459, 5463, 5464, 5478,
         5495, 5499, 5501, 5511, 5515, 5531, 5532, 5538, 5557, 5558, 5566,
         5572, 5582, 5585, 5611, 5629, 5633, 5634, 5668, 5686, 5687, 5698,
         5698, 5713, 5728, 5729, 5740, 5743, 5786, 5805, 5810, 5823, 5847,
         5847, 5870, 5875, 5892, 5895, 5905, 5918, 5923, 5936, 5976, 5986,
         5992, 6031, 6034, 6041, 6043, 6043, 6053, 6073, 6093, 6118, 6133,
         6140, 6153, 6169, 6192, 6196, 6203, 6207, 6211, 6227, 6230, 6233,
         6234, 6235, 6241, 6269, 6273, 6290, 6296, 6299, 6304, 6312, 6359,
         6370, 6392, 6398, 6421, 6436, 6457, 6460, 6530, 6536, 6536, 6544,
         6565, 6569, 6572, 6591, 6591, 6597, 6598, 6606, 6624, 6639, 6660,
         6664, 6685, 6691, 6734, 6770, 6772, 6778, 6779, 6784, 6786, 6824,
         6824, 6825, 6830, 6852, 6855, 6857, 6861, 6864, 6869, 6871, 6879,
         6883, 6883, 6889, 6891, 6904, 6917, 6966, 6969, 6978, 6998, 7001,
         7006, 7013, 7030, 7040, 7055, 7058, 7105, 7109, 7112, 7129, 7132,
         7148, 7175, 7216, 7261, 7264, 7273, 7282, 7286, 7290, 7328, 7333,
         7335, 7338, 7347, 7350, 7359, 7363, 7375, 7384, 7393, 7403, 7410,
         7415, 7421, 7424, 7429, 7436, 7442, 7444, 7446, 7458, 7460, 7461,
         7466, 7494, 7498, 7499, 7504, 7509, 7525, 7534, 7534, 7538, 7570,
         7572, 7580, 7582, 7591, 7592, 7605, 7639, 7641, 7665, 7691, 7693,
         7697, 7702, 7713, 7720, 7733, 7736, 7765, 7767, 7804, 7836, 7852,
         7865, 7870, 7907, 7965, 8009, 8090, 8120, 8156, 8167, 8173, 8227,
         8294, 8362, 8395, 8555, 8714])),
 (1925.1274361641945, 4504.3488372093025, 0.9908084868276722))
  • Correlation Matrix
In [79]:
# using Pearson Correlation
df = df_all[df_all.columns]

plt.figure(figsize=(12,10))
cor = df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Blues)
plt.show()
Out[79]:
<Figure size 1200x1000 with 0 Axes>
Out[79]:
<AxesSubplot:>

Comments / reasoning:

  • Regarding first total_count box plot, I observe no outliers in total bike rentals in this dataset
  • Regarding windspeed & humidity box plot analysis, I only observe outliers in windspeed and humidity features in this dataset
  • Regarding probability plot, there are few target variable data points that deviates from normality
  • Regarding the correlation matrix, I observe a significant positive correlation between season_fall and feel_temperature and also the target variable with feel_temperature

Feature Engineering

In [80]:
season_type = pd.get_dummies(df_all['season'], drop_first = True)
season_type.rename(columns={2:"season_summer", 3:"season_fall", 4:"season_winter"},inplace=True)
season_type.head()

weather_type = pd.get_dummies(df_all['weather_condition'], drop_first = True)
weather_type.rename(columns={2:"weather_mist_cloud", 3:"weather_light_snow_rain"},inplace=True)
weather_type.head()
Out[80]:
season_summer season_fall season_winter
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Out[80]:
weather_mist_cloud weather_light_snow_rain
0 1 0
1 1 0
2 0 0
3 0 0
4 0 0
In [81]:
# concatenate new dummy variables to df_all
df_all = pd.concat([df_all, season_type, weather_type], axis = 1)

# drop previous columns season & weathersit
df_all.drop(columns=["season", "weather_condition"],axis=1, inplace =True)
df_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   datetime                 731 non-null    datetime64[ns]
 1   year                     731 non-null    category      
 2   month                    731 non-null    category      
 3   holiday                  731 non-null    category      
 4   weekday                  731 non-null    category      
 5   workingday               731 non-null    category      
 6   feel_temperature         731 non-null    float64       
 7   humidity                 731 non-null    float64       
 8   windspeed                731 non-null    float64       
 9   total_count              731 non-null    int64         
 10  season_summer            731 non-null    uint8         
 11  season_fall              731 non-null    uint8         
 12  season_winter            731 non-null    uint8         
 13  weather_mist_cloud       731 non-null    uint8         
 14  weather_light_snow_rain  731 non-null    uint8         
dtypes: category(5), datetime64[ns](1), float64(3), int64(1), uint8(5)
memory usage: 36.9 KB

Questions and Answers

  1. Visualise rentals of bikes per day.
  2. Assume that each bike has exactly maximum 12 rentals per day.
    • Find the maximum number of bicycles nmax that was needed in any one day.
    • Find the 95%-percentile of bicycles n95 that was needed in any one day.
  3. Visualize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)
In [82]:
# calculate rentals of bikes per day of the week
total_rents_by_day = df_all[['datetime', 'total_count']]
#total_rents_by_day

# visualize data
plt.figure(figsize = (30, 15))

fig = px.line(total_rents_by_day, x = 'datetime', y = 'total_count', title = 'Total Rentals per Day')
fig.show()
Out[82]:
<Figure size 3000x1500 with 0 Axes>
<Figure size 3000x1500 with 0 Axes>
In [83]:
# Max Number of Bikes = Total requested riders / Max no of rides per bike
# Max Number of Bikes = total_count / 12

df_all["total_count_max12"] = df_all["total_count"]/12
#df_all.head()

# calculate the maximum number of bicycles nmax that was needed in any one day
nmax = pd.Series(df_all["total_count_max12"])
print("The maximum number of bicycles nmax that was needed in any one day is", round(nmax.quantile(1, 'nearest'), 1), "!")

# calculate the 95%-percentile of bicycles n95 that was needed in any one day
n95 = pd.Series(df_all["total_count_max12"])
print("The 95%-percentile of bicycles n95 that was needed in any one day is", round(nmax.quantile(0.95, 'nearest'), 1), "!")
The maximum number of bicycles nmax that was needed in any one day is 726.2 !
The 95%-percentile of bicycles n95 that was needed in any one day is 631.7 !
In [84]:
a = list(range(1,101))
b = [scoreatpercentile(df_all["total_count"],i) for i in a]

df2 = pd.DataFrame({'percentile': a, 'total_count': b}, columns=['percentile', 'total_count'])
fig = px.line(df2, x = 'percentile', y = 'total_count', title = 'Distribution of the Covered Days Depending on the Number of Available Bicycles')
fig

Part 3 - Building prediction models¶

Tasks:

  1. Define a test metric for predicting the daily demand for bike sharing, which you would like to use to measure the accuracy of the constructed models, and explain your choice.
  2. Build a demand prediction model with Random Forest, preferably making use of following python libraries: scikit-learn.
  3. Report the value of the chosen test metric on the provided data.

Define Test Metric

Bike sharing demand prediction refers to the process of forecasting the number of bicycles that will be rented within a specific time period, aiding in resource allocation and system optimization. For predicting the daily demand for bike sharing, which is a regression model, since the target variable is a quantity (over time) and consequent model evaluation to access the performance of this forecasting model, I'll use various metrics to evaluate the performance:

  • mean absolute error (MAE)
  • root mean squared error (RMSE)
  • coefficient of determination (R-squared).

MAE and RMSE measure the average magnitude of the errors between the predicted and actual values.
R-squared measures the proportion of variance in the target variable, explained by the input variables.

Train a Regression Model

Next step is to train a Regression Model (in this case Random Forest), which will use the potentially predictive features we have identified to forecast the “total_count” label

In [85]:
# define training dataset
df = df_all.iloc[:-30, :]

df.columns
df.dtypes
df.head(2)
Out[85]:
Index(['datetime', 'year', 'month', 'holiday', 'weekday', 'workingday',
       'feel_temperature', 'humidity', 'windspeed', 'total_count',
       'season_summer', 'season_fall', 'season_winter', 'weather_mist_cloud',
       'weather_light_snow_rain', 'total_count_max12'],
      dtype='object')
Out[85]:
datetime                   datetime64[ns]
year                             category
month                            category
holiday                          category
weekday                          category
workingday                       category
feel_temperature                  float64
humidity                          float64
windspeed                         float64
total_count                         int64
season_summer                       uint8
season_fall                         uint8
season_winter                       uint8
weather_mist_cloud                  uint8
weather_light_snow_rain             uint8
total_count_max12                 float64
dtype: object
Out[85]:
datetime year month holiday weekday workingday feel_temperature humidity windspeed total_count season_summer season_fall season_winter weather_mist_cloud weather_light_snow_rain total_count_max12
0 2011-01-01 0 1 0 6 0 0.36 0.81 0.16 985 0 0 0 1 0 82.08
1 2011-01-02 0 1 0 0 0 0.35 0.70 0.25 801 0 0 0 1 0 66.75
  • Set target variable
In [86]:
# dump not needed columns
training_data = df.drop(['datetime', 'total_count_max12'], axis=1)

# move total_count as last column
training_data = training_data[ [ col for col in training_data.columns if col != 'total_count' ] + ['total_count']]
training_data.head()
Out[86]:
year month holiday weekday workingday feel_temperature humidity windspeed season_summer season_fall season_winter weather_mist_cloud weather_light_snow_rain total_count
0 0 1 0 6 0 0.36 0.81 0.16 0 0 0 1 0 985
1 0 1 0 0 0 0.35 0.70 0.25 0 0 0 1 0 801
2 0 1 0 1 1 0.19 0.44 0.25 0 0 0 0 0 1349
3 0 1 0 2 1 0.21 0.59 0.16 0 0 0 0 0 1562
4 0 1 0 3 1 0.23 0.44 0.19 0 0 0 0 0 1600
  • Split into train and test data
In [87]:
# split the dataset into the train and test data
X_train, X_test, y_train, y_test = train_test_split(training_data.iloc[:,0:-1], training_data.iloc[:,-1], test_size = 0.2, random_state = 0)

print('x train :', X_train.shape,'\t\tx test :', X_test.shape)
print('y train :', y_train.shape,'\t\ty test :', y_test.shape)
x train : (560, 13) 		x test : (141, 13)
y train : (560,) 		y test : (141,)
  • Split the features into categorical and numerical features
In [88]:
# create a new dataset for train attributes
train_attributes = X_train[X_train.columns]

# create a new dataset for test attributes
test_attributes = X_test[X_test.columns]

# split dataframe by numerical and categorical columns
num_cols = X_train.select_dtypes(include = ['uint8', 'int64', 'float64']).columns
cat_cols = X_train.select_dtypes(include = ['object', 'bool', 'category']).columns

print("There are {} numeric columns and {} categorical columns".format(len(num_cols), len(cat_cols)))
There are 8 numeric columns and 5 categorical columns
  • Decoding the training attributes
In [89]:
# get dummy variables to encode the categorical features to numeric
train_encoded_attributes = pd.get_dummies(train_attributes, columns = cat_cols)

print('Shape of transfomed dataframe:', train_encoded_attributes.shape)
train_encoded_attributes.head(2)
Shape of transfomed dataframe: (560, 33)
Out[89]:
feel_temperature humidity windspeed season_summer season_fall season_winter weather_mist_cloud weather_light_snow_rain year_0 year_1 month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 holiday_0 holiday_1 weekday_0 weekday_1 weekday_2 weekday_3 weekday_4 weekday_5 weekday_6 workingday_0 workingday_1
572 0.74 0.60 0.28 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1
45 0.25 0.31 0.29 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1
  • Decoding the training attributes
In [90]:
# training dataset for modelling
X_train = train_encoded_attributes
y_train = y_train #.total_count.values
In [91]:
# training the model
X_train = train_encoded_attributes
model = RandomForestRegressor(random_state = 0, n_estimators = 200)
In [92]:
# fit the trained model
model.fit(X_train, y_train)
Out[92]:
RandomForestRegressor(n_estimators=200, random_state=0)
  • Cross validation prediction

Cross-validation is used to estimate the performance of machine learning models, more specificaly, it is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.

In [93]:
predict = cross_val_predict(model, X_train, y_train, cv=3)
In [94]:
# cross validation prediction plot
fig,ax = plt.subplots(figsize=(15,8))
ax.scatter(y_train, y_train-predict)
ax.axhline(lw=2,color='black')
ax.set_title('Cross validation prediction plot')
ax.set_xlabel('Observed')
ax.set_ylabel('Residual')
#plt.show()

# calculate equation for trendline
z = np.polyfit(y_train, y_train-predict, 1)
p = np.poly1d(z)

# add trendline to plot
plt.plot(y_train, p(y_train), color="lightgreen", linewidth=3, linestyle="--")
Out[94]:
<matplotlib.collections.PathCollection at 0x7f617a1e5cd0>
Out[94]:
<matplotlib.lines.Line2D at 0x7f617ec12100>
Out[94]:
Text(0.5, 1.0, 'Cross validation prediction plot')
Out[94]:
Text(0.5, 0, 'Observed')
Out[94]:
Text(0, 0.5, 'Residual')
Out[94]:
[<matplotlib.lines.Line2D at 0x7f617a1d1700>]
In [95]:
# R-squared scores
r2_scores = cross_val_score(model, X_train, y_train, cv=5)
print('R^2 scores :', np.average(r2_scores))
R^2 scores : 0.8567570676393114

Answers / comments / reasoning:

  • Observing cross validation prediction plot we see there is an apparent diagonal trend and the points where the predicted and actual values intersect generally follow the trend line. There is a good fitness of the model in this case, however it some data points present a higher variation. Normally if there's a variation, it represents the model’s residuals, which are the differences between the predicted label and the actual value of the validation label when the model applies the coefficients it learned during training to the validation data. By assessing these residuals from the validation data, we can estimate the level of error that can be expected when the model is used with new data for which the label is unknown.
  • The R-squared or coefficient of determination is ~ 85.7% on average for 5-fold cross validation, it means that predictor is only able to predict 85.7% of the variance in the target variable which is contributed by independent variables.
  • Decoding the test attributes
In [96]:
# get dummy variables to encode the categorical features to numeric
test_encoded_attributes=pd.get_dummies(test_attributes,columns=cat_cols)

print('Shape of transformed dataframe :', test_encoded_attributes.shape)
test_encoded_attributes.head(2)
Shape of transformed dataframe : (141, 33)
Out[96]:
feel_temperature humidity windspeed season_summer season_fall season_winter weather_mist_cloud weather_light_snow_rain year_0 year_1 month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 holiday_0 holiday_1 weekday_0 weekday_1 weekday_2 weekday_3 weekday_4 weekday_5 weekday_6 workingday_0 workingday_1
456 0.42 0.68 0.17 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0
675 0.28 0.57 0.17 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1

Model performance on test dataset

In [97]:
# predict the model
X_test = test_encoded_attributes
y_pred = model.predict(X_test)
In [98]:
# R-squared scores
r2_scores = cross_val_score(model, X_test, y_test, cv=5)
print('R^2 scores :', np.average(r2_scores))
R^2 scores : 0.8025713291399266

Model Optimization

In [99]:
# find best value for n_estimators
max = 0
index = -1
for i in range(10, 200):
    model = RandomForestRegressor(random_state = 0, n_estimators = i)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2_score = metrics.r2_score(y_test, y_pred)
    if r2_score > max:
        index = i
        max = r2_score
Out[99]:
RandomForestRegressor(n_estimators=10, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=11, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=12, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=13, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=14, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=15, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=16, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=17, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=18, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=19, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=20, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=21, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=22, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=23, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=24, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=25, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=26, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=27, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=28, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=29, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=30, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=31, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=32, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=33, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=34, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=35, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=36, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=37, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=38, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=39, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=40, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=41, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=42, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=43, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=44, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=45, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=46, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=47, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=48, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=49, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=50, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=51, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=52, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=53, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=54, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=55, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=56, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=57, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=58, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=59, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=60, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=61, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=62, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=63, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=64, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=65, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=66, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=67, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=68, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=69, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=70, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=71, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=72, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=73, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=74, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=75, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=76, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=77, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=78, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=79, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=80, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=81, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=82, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=83, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=84, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=85, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=86, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=87, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=88, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=89, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=90, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=91, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=92, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=93, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=94, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=95, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=96, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=97, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=98, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=99, random_state=0)
Out[99]:
RandomForestRegressor(random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=101, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=102, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=103, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=104, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=105, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=106, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=107, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=108, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=109, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=110, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=111, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=112, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=113, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=114, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=115, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=116, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=117, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=118, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=119, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=120, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=121, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=122, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=123, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=124, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=125, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=126, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=127, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=128, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=129, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=130, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=131, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=132, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=133, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=134, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=135, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=136, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=137, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=138, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=139, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=140, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=141, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=142, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=143, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=144, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=145, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=146, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=147, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=148, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=149, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=150, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=151, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=152, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=153, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=154, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=155, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=156, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=157, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=158, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=159, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=160, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=161, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=162, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=163, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=164, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=165, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=166, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=167, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=168, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=169, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=170, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=171, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=172, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=173, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=174, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=175, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=176, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=177, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=178, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=179, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=180, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=181, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=182, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=183, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=184, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=185, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=186, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=187, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=188, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=189, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=190, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=191, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=192, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=193, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=194, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=195, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=196, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=197, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=198, random_state=0)
Out[99]:
RandomForestRegressor(n_estimators=199, random_state=0)
In [100]:
# confirm same dimension for the target variable
y_test.shape
y_pred.shape
Out[100]:
(141,)
Out[100]:
(141,)
In [101]:
model_opt = RandomForestRegressor(random_state = 0, n_estimators = index)
model_opt.fit(X_train, y_train)
y_pred = model_opt.predict(X_test)

mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
me = metrics.max_error(y_test, y_pred)

print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)
print('Max Error:', me)
Out[101]:
RandomForestRegressor(n_estimators=195, random_state=0)
Mean Absolute Error: 567.2639388979815
Mean Squared Error: 614874.5165653429
Max Error: 2562.769230769231
In [102]:
# R-squared scores
r2_scores = cross_val_score(model_opt, X_test, y_test, cv=5)
print('R^2 scores :', np.average(r2_scores))
R^2 scores : 0.8026568491315675

Answers / comments / reasoning:

  • After parameter tuning, the R-squared or coefficient of determination is the same (~80.3%), it means that this optimization did not enhance the results of the model, probably because we have limited and small dataset and also the difference in n_estimators was less than 3%. It would at least optimize the running time if dataset were bigger and less balanced.
In [110]:
# save the optimized trained model as a pickle file
#saved_model = pickle.dumps('model/model_opt_18092023')
  
# Load the pickled model
#rf_model_pkl = pickle.loads('saved_model')
  
# Use the loaded pickled model to make predictions
#rf_model_pkl.predict(X_test)
  • Residual plot
In [111]:
# residual scatter plot
fig, ax = plt.subplots(figsize=(15,8))
residuals=y_test-y_pred
ax.scatter(y_test, residuals)
ax.axhline(lw=2, color='black')
ax.set_xlabel('Observed')
ax.set_ylabel('Residuals')
ax.set_title('Residual plot')
#plt.show()

# calculate equation for trendline
z = np.polyfit(y_test, residuals, 1)
p = np.poly1d(z)

# add trendline to plot
plt.plot(y_test, p(y_test), color="lightgreen", linewidth=3, linestyle="--")
Out[111]:
<matplotlib.collections.PathCollection at 0x7f61da4a80a0>
Out[111]:
<matplotlib.lines.Line2D at 0x7f617d98f6d0>
Out[111]:
Text(0.5, 0, 'Observed')
Out[111]:
Text(0, 0.5, 'Residuals')
Out[111]:
Text(0.5, 1.0, 'Residual plot')
Out[111]:
[<matplotlib.lines.Line2D at 0x7f61dbd879a0>]

Predicting Bike Rental count on Daily basis, in Out-of-Sample Data

Out-of-sample testing is used to evaluate the performance of a strategy on a separate set of data that was not used during the development and optimisation process.
This helps to determine whether the strategy would be able to perform well on new, unseen data

In [112]:
# define out-of-sample dataset
df_last30 = df_all.tail(30)
df_last30.head()
Out[112]:
datetime year month holiday weekday workingday feel_temperature humidity windspeed total_count season_summer season_fall season_winter weather_mist_cloud weather_light_snow_rain total_count_max12
701 2012-12-02 1 12 0 0 0 0.36 0.82 0.12 4649 0 0 1 1 0 387.42
702 2012-12-03 1 12 0 1 1 0.46 0.77 0.08 6234 0 0 1 0 0 519.50
703 2012-12-04 1 12 0 2 1 0.47 0.73 0.17 6606 0 0 1 0 0 550.50
704 2012-12-05 1 12 0 3 1 0.43 0.48 0.32 5729 0 0 1 0 0 477.42
705 2012-12-06 1 12 0 4 1 0.26 0.51 0.17 5375 0 0 1 0 0 447.92
In [113]:
# save date variable
times = df_last30['datetime']

# dump not needed columns
testing_data = df_last30.drop(['datetime', 'total_count_max12'], axis=1)

# move total_count as last column
testing_data = testing_data[ [ col for col in testing_data.columns if col != 'total_count' ] + ['total_count']]
testing_data.head()
print('Shape of OOS dataframe :', testing_data.shape)
Out[113]:
year month holiday weekday workingday feel_temperature humidity windspeed season_summer season_fall season_winter weather_mist_cloud weather_light_snow_rain total_count
701 1 12 0 0 0 0.36 0.82 0.12 0 0 1 1 0 4649
702 1 12 0 1 1 0.46 0.77 0.08 0 0 1 0 0 6234
703 1 12 0 2 1 0.47 0.73 0.17 0 0 1 0 0 6606
704 1 12 0 3 1 0.43 0.48 0.32 0 0 1 0 0 5729
705 1 12 0 4 1 0.26 0.51 0.17 0 0 1 0 0 5375
Shape of OOS dataframe : (30, 14)
In [114]:
# create a new dataset for test attributes
testing_data_attributes = testing_data[testing_data.columns]

# split dataframe by numerical and categorical columns
num_cols = testing_data.select_dtypes(include = ['uint8', 'int64', 'float64']).columns
cat_cols = testing_data.select_dtypes(include = ['object', 'bool', 'category']).columns

print("There are {} numeric columns and {} categorical columns".format(len(num_cols), len(cat_cols)))

# get dummy variables to encode the categorical features to numeric
testing_data_encoded_attributes = pd.get_dummies(testing_data_attributes, columns=cat_cols)

# drop target variable
testing_data_encoded_attributes = testing_data_encoded_attributes.drop(['total_count'], axis = 1)

print('Shape of transformed dataframe :', testing_data_encoded_attributes.shape)
testing_data_encoded_attributes.head(2)
There are 9 numeric columns and 5 categorical columns
Shape of transformed dataframe : (30, 33)
Out[114]:
feel_temperature humidity windspeed season_summer season_fall season_winter weather_mist_cloud weather_light_snow_rain year_0 year_1 month_1 month_2 month_3 month_4 month_5 month_6 month_7 month_8 month_9 month_10 month_11 month_12 holiday_0 holiday_1 weekday_0 weekday_1 weekday_2 weekday_3 weekday_4 weekday_5 weekday_6 workingday_0 workingday_1
701 0.36 0.82 0.12 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0
702 0.46 0.77 0.08 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1
In [115]:
# make predictions
y_pred_testing = model_opt.predict(testing_data_encoded_attributes)
In [116]:
# submit final sample
Submission = pd.DataFrame({'datetime' : times, 'pred' : y_pred_testing})
Submission.set_index('datetime', inplace = True)
Submission.to_csv('output/sample_submission.csv')
Submission
Out[116]:
pred
datetime
2012-12-02 4151.09
2012-12-03 6463.96
2012-12-04 6707.82
2012-12-05 6322.25
2012-12-06 3339.76
2012-12-07 4379.96
2012-12-08 4517.68
2012-12-09 4122.96
2012-12-10 3984.20
2012-12-11 4633.91
2012-12-12 4518.00
2012-12-13 4418.54
2012-12-14 4437.31
2012-12-15 4741.36
2012-12-16 4373.70
2012-12-17 4273.49
2012-12-18 4951.97
2012-12-19 4913.23
2012-12-20 4577.46
2012-12-21 3727.81
2012-12-22 2820.97
2012-12-23 2944.27
2012-12-24 2836.15
2012-12-25 3348.11
2012-12-26 2815.71
2012-12-27 3002.12
2012-12-28 3216.13
2012-12-29 2669.78
2012-12-30 2853.44
2012-12-31 2881.16

Final Conclusions

Part 4 - Reflection / comments¶

Tasks: (Optional) Please share with us any free form reflection, comments or feedback you have in the context of this test task.

In summary, this notebook conducted a comprehensive analysis of daily bike rental data in two years time basis. Hands-on in data exploration, preprocessing, and feature engineering to prepare the data for modeling. The exploratory data analysis provided valuable insights into rental patterns based on different factors, such as weather, day of the week, seasonality, etc. In conclusion, this notebook gave us great insights on bike rental trends and successfully predicted rental counts using the Random Forest model.

Future improvements:

  1. Exploratory Data Analysis using hour dataset
  2. More feature engineering using day/hour dataset features
  3. Tuning other parameters inside RF
  4. Apply other models to compare performances
  5. Employing advanced feature selection/explainability techniques (eg. SHAP)

The analysis and insights presented here can provide valuable guidance for bike-sharing companies to optimize their services and meet the diverse preferences of their user base.

Submission¶

Please submit this notebook with your developments in .ipynb and .html formats as well as your requirements.txt file.

References¶

[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.